5 research outputs found
Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems
Ensuring the reliability of cloud systems is critical for both cloud vendors
and customers. Cloud systems often rely on virtualization techniques to create
instances of hardware resources, such as virtual machines. However,
virtualization hinders the observability of cloud systems, making it
challenging to diagnose platform-level issues. To improve system observability,
we propose to infer functional clusters of instances, i.e., groups of instances
having similar functionalities. We first conduct a pilot study on a large-scale
cloud system, i.e., Huawei Cloud, demonstrating that instances having similar
functionalities share similar communication and resource usage patterns.
Motivated by these findings, we formulate the identification of functional
clusters as a clustering problem and propose a non-intrusive solution called
Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions
instances into coarse-grained chunks based on communication patterns. Within
each chunk, Prism further groups instances with similar resource usage patterns
to produce fine-grained functional clusters. Such a design reduces noises in
the data and allows Prism to process massive instances efficiently. We evaluate
Prism on two datasets collected from the real-world production environment of
Huawei Cloud. Our experiments show that Prism achieves a v-measure of ~0.95,
surpassing existing state-of-the-art solutions. Additionally, we illustrate the
integration of Prism within monitoring systems for enhanced cloud reliability
through two real-world use cases.Comment: The paper was accepted by the 38th IEEE/ACM International Conference
on Automated Software Engineering (ASE 2023
Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection
Performance issues permeate large-scale cloud service systems, which can lead
to huge revenue losses. To ensure reliable performance, it's essential to
accurately identify and localize these issues using service monitoring metrics.
Given the complexity and scale of modern cloud systems, this task can be
challenging and may require extensive expertise and resources beyond the
capacity of individual humans. Some existing methods tackle this problem by
analyzing each metric independently to detect anomalies. However, this could
incur overwhelming alert storms that are difficult for engineers to diagnose
manually. To pursue better performance, not only the temporal patterns of
metrics but also the correlation between metrics (i.e., relational patterns)
should be considered, which can be formulated as a multivariate metrics anomaly
detection problem. However, most of the studies fall short of extracting these
two types of features explicitly. Moreover, there exist some unlabeled
anomalies mixed in the training data, which may hinder the detection
performance. To address these limitations, we propose the Relational- Temporal
Anomaly Detection Model (RTAnomaly) that combines the relational and temporal
information of metrics. RTAnomaly employs a graph attention layer to learn the
dependencies among metrics, which will further help pinpoint the anomalous
metrics that may cause the anomaly effectively. In addition, we exploit the
concept of positive unlabeled learning to address the issue of potential
anomalies in the training data. To evaluate our method, we conduct experiments
on a public dataset and two industrial datasets. RTAnomaly outperforms all the
baseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920,
demonstrating its superiority